268 research outputs found
Evaluating Datalog via Tree Automata and Cycluits
We investigate parameterizations of both database instances and queries that
make query evaluation fixed-parameter tractable in combined complexity. We show
that clique-frontier-guarded Datalog with stratified negation (CFG-Datalog)
enjoys bilinear-time evaluation on structures of bounded treewidth for programs
of bounded rule size. Such programs capture in particular conjunctive queries
with simplicial decompositions of bounded width, guarded negation fragment
queries of bounded CQ-rank, or two-way regular path queries. Our result is
shown by translating to alternating two-way automata, whose semantics is
defined via cyclic provenance circuits (cycluits) that can be tractably
evaluated.Comment: 56 pages, 63 references. Journal version of "Combined Tractability of
Query Evaluation via Tree Automata and Cycluits (Extended Version)" at
arXiv:1612.04203. Up to the stylesheet, page/environment numbering, and
possible minor publisher-induced changes, this is the exact content of the
journal paper that will appear in Theory of Computing Systems. Update wrt
version 1: latest reviewer feedbac
The Hidden Web, XML and Semantic Web: A Scientific Data Management Perspective
The World Wide Web no longer consists just of HTML pages. Our work sheds
light on a number of trends on the Internet that go beyond simple Web pages.
The hidden Web provides a wealth of data in semi-structured form, accessible
through Web forms and Web services. These services, as well as numerous other
applications on the Web, commonly use XML, the eXtensible Markup Language. XML
has become the lingua franca of the Internet that allows customized markups to
be defined for specific domains. On top of XML, the Semantic Web grows as a
common structured data source. In this work, we first explain each of these
developments in detail. Using real-world examples from scientific domains of
great interest today, we then demonstrate how these new developments can assist
the managing, harvesting, and organization of data on the Web. On the way, we
also illustrate the current research avenues in these domains. We believe that
this effort would help bridge multiple database tracks, thereby attracting
researchers with a view to extend database technology.Comment: EDBT - Tutorial (2011
Connecting Width and Structure in Knowledge Compilation
Several query evaluation tasks can be done via knowledge compilation: the query result is compiled as a lineage circuit from which the answer can be determined. For such tasks, it is important to leverage some width parameters of the circuit, such as bounded treewidth or pathwidth, to convert the circuit to structured classes, e.g., deterministic structured NNFs (d-SDNNFs) or OBDDs. In this work, we show how to connect the width of circuits to the size of their structured representation, through upper and lower bounds. For the upper bound, we show how bounded-treewidth circuits can be converted to a d-SDNNF, in time linear in the circuit size. Our bound, unlike existing results, is constructive and only singly exponential in the treewidth. We show a related lower bound on monotone DNF or CNF formulas, assuming a constant bound on the arity (size of clauses) and degree (number of occurrences of each variable). Specifically, any d-SDNNF (resp., SDNNF) for such a DNF (resp., CNF) must be of exponential size in its treewidth; and the same holds for pathwidth when compiling to OBDDs. Our lower bounds, in contrast with most previous work, apply to any formula of this class, not just a well-chosen family. Hence, for our language of DNF and CNF, pathwidth and treewidth respectively characterize the efficiency of compiling to OBDDs and (d-)SDNNFs, that is, compilation is singly exponential in the width parameter. We conclude by applying our lower bound results to the task of query evaluation
Computing Possible and Certain Answers over Order-Incomplete Data
This paper studies the complexity of query evaluation for databases whose
relations are partially ordered; the problem commonly arises when combining or
transforming ordered data from multiple sources. We focus on queries in a
useful fragment of SQL, namely positive relational algebra with aggregates,
whose bag semantics we extend to the partially ordered setting. Our semantics
leads to the study of two main computational problems: the possibility and
certainty of query answers. We show that these problems are respectively
NP-complete and coNP-complete, but identify tractable cases depending on the
query operators or input partial orders. We further introduce a duplicate
elimination operator and study its effect on the complexity results.Comment: 55 pages, 56 references. Extended journal version of
arXiv:1707.07222. Up to the stylesheet, page/environment numbering, and
possible minor publisher-induced changes, this is the exact content of the
journal paper that will appear in Theoretical Computer Scienc
Online Influence Maximization (Extended Version)
Social networks are commonly used for marketing purposes. For example, free
samples of a product can be given to a few influential social network users (or
"seed nodes"), with the hope that they will convince their friends to buy it.
One way to formalize marketers' objective is through influence maximization (or
IM), whose goal is to find the best seed nodes to activate under a fixed
budget, so that the number of people who get influenced in the end is
maximized. Recent solutions to IM rely on the influence probability that a user
influences another one. However, this probability information may be
unavailable or incomplete. In this paper, we study IM in the absence of
complete information on influence probability. We call this problem Online
Influence Maximization (OIM) since we learn influence probabilities at the same
time we run influence campaigns. To solve OIM, we propose a multiple-trial
approach, where (1) some seed nodes are selected based on existing influence
information; (2) an influence campaign is started with these seed nodes; and
(3) users' feedback is used to update influence information. We adopt the
Explore-Exploit strategy, which can select seed nodes using either the current
influence probability estimation (exploit), or the confidence bound on the
estimation (explore). Any existing IM algorithm can be used in this framework.
We also develop an incremental algorithm that can significantly reduce the
overhead of handling users' feedback information. Our experiments show that our
solution is more effective than traditional IM methods on the partial
information.Comment: 13 pages. To appear in KDD 2015. Extended versio
Provenance and Probabilities in Relational Databases: From Theory to Practice
International audienceWe review the basics of data provenance in relational databases. We describe different provenance formalisms, from Boolean provenance to provenance semirings and beyond, that can be used for a wide variety of purposes, to obtain additional information on the output of a query. We discuss representation systems for data provenance, circuits in particular, with a focus on practical implementation. Finally, we explain how provenance is practically used for probabilistic query evaluation in probabilistic databases
Archivage du Web
National audienceL’archivage du Web est un processus de collecte, de sélection, d’enrichissement, de stockage, de préservation et de mise à disposition des informations du Web actuel, afin qu’elles restent accessibles aux utilisateurs dans l’avenir. L’objectif de cette démarche est de permettre, par exemple, à un historien dans trente ans de pouvoir étudier la manière dont un événement politique a été commenté par les parties prenantes, les médias et les simples utilisateurs du Web ; à un juge de pouvoir décider dans cinq ans si telle action était en violation des termes d’utilisation d’un service Web tels qu’ils étaient formulés à l’époque des faits ; ou encore, à un sociologue de réaliser dans vingt ans une étude diachronique d’une communauté à travers les traces que cette communauté a laissées sur le Web
An Experimental Study of the Treewidth of Real-World Graph Data
Treewidth is a parameter that measures how tree-like a relational instance is, and whether it can reasonably be decomposed into a tree. Many computation tasks are known to be tractable on databases of small treewidth, but computing the treewidth of a given instance is intractable. This article is the first large-scale experimental study of treewidth and tree decompositions of real-world database instances (25 datasets from 8 different domains, with sizes ranging from a few thousand to a few million vertices). The goal is to determine which data, if any, can benefit of the wealth of algorithms for databases of small treewidth. For each dataset, we obtain upper and lower bound estimations of their treewidth, and study the properties of their tree decompositions. We show in particular that, even when treewidth is high, using partial tree decompositions can result in data structures that can assist algorithms
- …